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Abstract 

Background: A common problem in phylogenetic analysis is to identify frequent patterns in a collection of 
phylogenetic trees. The goal is, roughly, to find a subset of the species (taxa) on which all or some significant subset of 
the trees agree. One popular method to do so is through maximum agreement subtrees (MASTs). MASTs are also 
used, among other things, as a metric for comparing phylogenetic trees, computing congruence indices and to 
identify horizontal gene transfer events. 

Results: We give algorithms and experimental results for two approaches to identify common patterns in a 
collection of phylogenetic trees, one based on agreement subtrees, called maximal agreement subtrees, the other on 
frequent subtrees, called maximal frequent subtrees. These approaches can return subtrees on larger sets of taxa than 
MASTs, and can reveal new common phylogenetic relationships not present in either MASTs or the majority rule tree 
(a popular consensus method). Our current implementation is available on the web at https://code.google.eom/p/ 
mfst- miner/. 

Conclusions: Our computational results confirm that maximal agreement subtrees and all maximal frequent 
subtrees can reveal a more complete phylogenetic picture of the common patterns in collections of phylogenetic 
trees than maximum agreement subtrees; they are also often more resolved than the majority rule tree. Further, our 
experiments show that enumerating maximal frequent subtrees is considerably more practical than enumerating 
ordinary (not necessarily maximal) frequent subtrees. 

Keywords: Phylogenetic trees, Evolutionary trees, Maximum agreement subtree, Frequent subtrees, Maximal 
frequent subtrees, Reverse search 



Background 

A phylogenetic tree is an unordered rooted tree whose 
leaves are in one-to-one correspondence with a set of 
species (also referred to as taxa); its topology represents 
the hypothetical evolutionary relationships among these 
species. 

An agreement subtree (AST) for a collection of phylo- 
genetic trees on a common leaf set is a minimal subtree 
connecting a fixed set of leaves that is homeomorphically 
included in all of the input trees. A maximal agreement 
subtree (MXST) is an agreement subtree that is not a 
subtree of any other agreement subtree. An MXST is a 
maximum agreement subtree (MAST) if it has the largest 
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number of leaves [1]. MASTs are used, among other 
things, as a metric for comparing phylogenetic trees [2-4], 
computing their congruence index [5,6], to identify hori- 
zontal gene transfer events [7], for resolving ambiguity in 
terraces in phylogenetic tree space [8], and as a consensus 
approach [9]. 

An MXST can reveal shared phylogenetic information 
not displayed by any of the MASTs (see Figure 1). We can 
uncover even more common substructure by relaxing the 
requirement that the subtree returned must be supported 
by all the input trees. Let / be a number in the interval 
Q, l]. An /-frequent subtree, or a frequent subtree (FST) 
for short, in a collection of m leaf-labeled trees on a com- 
mon leaf set, is a minimal subtree connecting a fixed set of 
leaves that is homeomorphically included in at least / • m 
of the input trees. A maximal FST (MFST) is an FST that 
is not a subtree of any other FST. We choose / greater 
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Figure 1 Motivating example 1. (a) A collection of two trees and their (b) MAST, (c) An MXSTthat has fewer leaves than the MAST but is not 
displayed by it. 



than ^, because (i) it conveys confidence that a majority of 
the input trees support the /-frequent subtree, and (ii) it 
ensures uniqueness: on a given set of leaves, there can be 
at most one /-frequent subtree. Observe that an MXST is 
an MFST with/ = 1. 

The set of all MFSTs is a compact non-redundant sum- 
mary of the set of all FSTs: every FST is a subtree of 
one or more MFSTs but every MFST is a subtree of only 
itself. Thus, every MFST reveals some unique phyloge- 
netic information that is not displayed by any other MFST 
(or FST). 

Also, since there can be exponentially more FSTs than 
MFSTs, mining MFSTs can be much faster than mining 
all FSTs, and the result set produced is much smaller and 
easier to analyze. 

A well-supported MFST can have more leaves and be 
more resolved than a MAST (see "Results and discussion" 
on page 14), and thus can reveal phylogenetic information 
not displayed by any of the MASTs. In the more general 
setting where there is little overlap among the leaf sets of 
the input trees, the gap between the size of an MFST and 
the size of a MAST can be even wider. Indeed, in this case 
any agreement tree would tend to be quite small — see 
Figure 2. 

Despite its potential utility, however, the enumeration of 
all MFSTs in collections of phylogenetic trees has not, to 
our knowledge, been studied before. 

Here we introduce MFSTMlNER, an algorithm for enu- 
merating MXSTs and MFSTs. MfstMiner enumerates 



MFSTs over partially overlapping leaf sets as well. We 
compare MfstMiner with EvoMiner [10], an algorithm 
for enumerating all FSTs, and show that enumerating 
MFSTs can be orders of magnitude faster than enumer- 
ating all FSTs. Our current implementation of MFST- 
MINER, which works for up to 250 leaves and 10000 
trees, can be downloaded from https://code.google.com/ 
p/mfst- miner/. 

Related work 

The MAST problem was first studied by Finden and 
Gordon [1]. Since then, due its utility and inherent com- 
plexity, the problem has attracted computational biolo- 
gists and mathematicians alike. The MAST of two trees 
can be found in polynomial time; indeed, over the years 
researchers have developed progressively faster algo- 
rithms for the problem [11-13]. Finding the MAST of 
more than two trees is NP-hard in general [11], but is 
solvable in polynomial time for trees of bounded degree 
[14,15]. 

Although maximal subgraph mining [16,17] and, in par- 
ticular, maximal subtree mining [18-20] have received 
much attention in the data mining literature, a different 
approach is needed for mining phylogenetic trees. This is 
because phylogenetic trees possess a special structure — 
only leaves are labeled and the non-leaf nodes must be of 
degree two or more — that affects the very definition of a 
subtree [10,21]. We defer the formal definitions of phylo- 
genetic trees and subtrees to the Preliminaries, on page 4. 
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Figure 2 Motivating example 2. (a) A collection of three trees and (b) an MFST with f = |. MAST or MRT cannot be applied, as the common 
overlap consists of only two leaves. 
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Zhang et al. [21] where the first to study frequent phy- 
logenetic subtree mining. They proposed an algorithm, 
Phylominer, to mine all frequent subtrees in a collection 
of phylogenetic trees in time quadratic in the size of the 
output set. In [10] we proposed a new algorithm EvoM- 
INER for the same task that achieves speed-ups of up 
to 100 times or more over Phylominer. Both Phylominer 
and EvoMlNER follow an Apriori-like framework [22]. 
EvoMlNERs increased speed is the result of an efficient 
phylogenetic tree-specific constant-time candidate gener- 
ation scheme in the candidate generation step, and a novel 
fingerprinting based scheme for the downward-closure 
operation in the frequency counting step. 

Ramu et al. [23] proposed a heuristic for enumerating a 
subset of all MFSTs called maximum frequent subtrees. A 
maximum frequent subtree is an FST that has the max- 
imum number of leaves among all FSTs. Their method 
scales well for large phylogenetic datasets, but does not 
guarantee the enumeration of all MFSTs. To our knowl- 
edge, our work is the first to deal with the problem of 
mining all MFSTs for phylogenetic trees. 

Consensus methods are an oft-used alternative to fre- 
quent or agreement subtree methods, for summarizing 
the common information in collections of phylogenetic 
trees. Among the most popular consensus methods is the 
majority -rule tree (MRT) [24], defined as follows. 

A cluster in a tree is the set of all leaf descendants of 
some node in the tree. The MRT of a collection of trees 
is the tree that exhibits all clusters present in the major- 
ity — i.e., strictly more than 50% — of the input trees. 
(Note the parallels between the use of majority clusters 
and the choice of/ e Q,l] for MFSTs.) The MRT, 
though linear-time computable, is very sensitive to the 
presence of "rogue" taxa; that is, taxa whose positions 
vary widely within the input collection [25,26]. MFSTs are 
less sensitive to this phenomenon, because the MRT by 
definition must contain the entire leaf set (including the 
rogue taxa), whereas MFSTs have no such restriction (see 
Figure 3). The fact that MASTs are less sensitive to rogue 



taxa than MRTs has been well-acknowledged in the litera- 
ture [25,27,28]. MFSTs, which include MASTs as a special 
case, are even more likely to reveal informative common 
substructures in the presence of rogue taxa. 

Phylogenetic networks represent evolutionary relation- 
ships among taxa via directed graphs. In addition to tree 
nodes — nodes with only one parent — , they allow hybrid 
nodes — nodes with two parents. Thus, phylogenetic net- 
works are more expressive than phylogenetic trees [29,30]. 
In the same way that agreement trees and majority rule 
trees extract consensus information in phylogenetic trees, 
consensus networks [31] represent frequent patterns in 
phylogenetic networks [32] . 

Preliminaries 

A phylogenetic tree (or, for brevity, simply a tree) is an 
unordered rooted leaf-labeled tree. Leaf labels represent 
the taxonomic units (species or taxa) under study. An iso- 
morphism between phylogenetic trees includes the labels 
of the leaves. Phylogenetic trees can also be unrooted 
[33], but here we deal exclusively with rooted phylogenetic 
trees. A node is internal if it is not a leaf node. Each inter- 
nal node must have at least two children. Let Cj denote 
the leaf label set of tree T, and \J/t denote the bijection 
that maps the leaf nodes to their unique labels. For conve- 
nience, we refer to the set of leaf nodes by their labels in 
Cj. From this point forward, unless the context requires 
making a distinction, we will drop the subscripts in Cj 
and i/fTf and write C and \jr respectively. For the rest of the 
paper, we assume without loss of generality that the leaf 
label set C consists of distinct integers in the range [1, |£|]; 
thus, the labels are ordered. We denote the fact that two 
trees T\ and T<i are isomorphic by writing T\ = T2. 

Let T be a tree. Suppose u is an internal non-root node 
in T, such that u has only one child v. Then, suppressing u 
means contracting the edge (u, v); i.e., deleting u and the 
two edges incident on it, and then adding an edge from the 
parent of u to v. For example, in Figure 4(a), to suppress u, 
it is deleted and an edge is added from t to v. To prune a 
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Figure 3 Motivating example 3. (a) Three input trees, (b) Their MAST, which is star-like, (c) The majority rule tree, which is also star-like, (d) Two 
MFSTs with f = |, each highly resolved and larger than the MAST. 



Deepak and Fernandez-Baca Algorithms for Molecular Biology 2014, 9:1 6 
http://www.almob.Org/content/9/1/16 



Page 4 of 30 




leaf I, we first delete it. Let u be Is parent. If u is not the 
root, and the deletion of i makes u a degree-two node, we 
suppress u (see Figure 4(b)). If u is the root and deleting i 
makes it a degree one node, u is deleted and its remaining 
child becomes the new root (see Figure 4(c)). Otherwise, 
u remains as it is (see Figure 4(d)). Grafting is the reverse 
of pruning a leaf. Consider a leaf i £ Cj. To graft i in 
r, we first select a node or an edge in T. If the selection 
is a non-root node u, we make i a child of u in T (see 
Figure 4(e)). We call this grafting of i on node u. If the 
selection is root node r, we have two options: (i) graft i on 
r as if r is a non-root node, or (ii) create a new node r and 
make r and / as children of r! In case (ii), r becomes the 
new root in T (see Figure 4(f)); we call this grafting i on 
top of root node r. If the selection is an edge (u, v), where u 
is the parent of v, we delete edge (u, v), create a new node 
u\ make u a child of u, and, i and v children of u (see 
Figure 4(g)). We call this grafting of i on edge (u, v). Let 
V denote the resulting tree. Then, clearly, in each of the 
cases, T can be obtained by pruning i in T\ 

Consider a tree T and a set CJ c C T . The restriction of 
T to C , denoted by T\cj, is the minimal homeomorphic 
subtree of T connecting the leaves with labels in CJ (that 
is, we start with the minimal subtree of T connecting C 
and repeatedly suppress non-root nodes with at most one 
child until no such nodes remain). A tree T is a subtree 
of T 'if Cx f ^ £>t and T' = T\c T , . Tree T displays T if T 
is a subtree of T. 

The depth of a node w in a tree T, denoted depth r (w), is 
the number of edges from the root to that node; thus the 
root node is at depth 0. We denote the lowest common 
ancestor (LCA) of two nodes u and v in T by LCA r (w, v). 



When the tree T is clear from the context, we drop the 
superscripts. A /{-leaf tree is a tree with k leaves. A triplet 
is a 3-leaf tree. 

Algorithmic framework 

We first discuss the algorithm for ASTs/MXSTs because it 
is simpler (since / = 1). We then extend it to FSTs/MFSTs. 

We enumerate all MXSTs from the solution space of all 
ASTs. We show that any /c-leaf AST can be enumerated 
by combining two unique (k — l)-leaf ASTs with certain 
properties. We call the /c-leaf tree a join on the two smaller 
(k — l)-leaf trees. To ensure that the enumeration is effi- 
cient, we must address three issues. The first is avoiding 
redundancy — that is, each MXST should be generated 
only once. 

The second is support estimation. While a /c-leaf AST is 
enumerated by joining two unique (k — l)-leaf ASTs, the 
converse is not true, i.e., these two (k — l)-leaf ASTs can 
potentially combine into more than one topology over k 
leaves. The only way to know if a /c-leaf AST exists as a 
result of joining these two (k — l)-leaf ASTs is to have a 
mechanism to test if only one topology is supported across 
all input trees. The third issue is limiting combinatorial 
explosion. The total number of ASTs can be exponentially 
larger than the total number of MXSTs. Thus, we need a 
way to prune the search space of ASTs during enumera- 
tion of MXSTs. We describe how we address these issues 
next. 

Non-redundant enumeration 

To avoid generating multiple isomorphic copies of the 
same tree, we enumerate subtrees in "canonical form" 
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[21] (an ordered representation for phylogenetic trees). To 
enumerate every canonical representation once, we define 
a parent-child relationship over the space of all ASTs. 
This induces an enumeration tree over the solution space, 
where each node represents a collection of ASTs grouped 
together via an equivalence relation. Leaf nodes represent 
potential MXSTs and each MXST belongs to a unique 
leaf node. This scheme is motivated by the reverse search 
technique for enumeration [34]. 

Canonical form 

The virtual label of an internal node v is the mini- 
mum label among all leaf descendants of v. Consider a 
left-to-right order on the children of an internal node 
based on the sequence in which they are encountered 
in an inorder depth-first traversal (IDFT), the leftmost 
child being encountered first. Then, a tree T is in canon- 
ical form [21] if, for every internal node, its children 
are ordered from left to right by their virtual labels. It 
can be seen that two trees are isomorphic if and only 
if they have the same canonical forms. By generating all 
trees in canonical form, it is straightforward to test if 
two trees are isomorphic and prevent duplicate enumer- 
ation. MFSTMlNER relies on this property to ensure that 
each FST is enumerated exactly once. Henceforth, we 
assume all trees to be in canonical form unless mentioned 
otherwise. 

Enumeration tree 

The key notion for defining the enumeration tree is that 
of an equivalence class; to explain it, we first need some 
definitions. The rightmost leaf of tree T is the last leaf 
encountered in the IDFT of T. The subtree that results 
from pruning the rightmost leaf is called the prefix tree 
or prefix for short. It is so called because the IDFT of the 
prefix tree is the largest prefix of the IDFT of the original 
tree that is not the original tree. A useful property of the 
canonical form is that pruning either the last or second- 
to-last leaf encountered in the IDFT of a tree results in 
a canonical tree [21]. The heaviest subtree [21] is the 



subtree rooted at the parent of the rightmost leaf. Figure 5 
illustrates the defined concepts. 

An equivalence class is a set E of canonical trees that 
share a common prefix. We call this common prefix tree 
the core tree of E and denote it by E c . Note that an equiv- 
alence class of /c-leaf trees has a (k — l)-leaf core tree. 
For an AST T, Ej denotes the equivalence class that has 
T as its core tree. Any two trees in an equivalence class 
differ only with respect to their rightmost leaf; therefore, 
topologically, their difference is restricted to their heavi- 
est subtrees. The equivalence relation "sharing a common 
prefix" partitions any set of canonical trees into disjoint 
subsets. Each such subset is an equivalence class identified 
by its unique core tree. 

Figure 6 gives an example of an enumeration tree. Each 
node in the enumeration tree represents a unique equiva- 
lence class. An equivalence class E is the parent of equiv- 
alence class F if F c e E. Clearly each node has a unique 
parent. Note that as we traverse from an internal node 
towards a leaf, the core tree of each node in the path corre- 
sponds to a new leaf being added as a suffix to the IDFT of 
the core tree of its parent node. Thus, if equivalence class 
E is an ancestor of equivalence class F, the IDFT of E c is a 
prefix of the IDFT of F c . 

A node in the enumeration tree is a leaf if its core 
tree is not a prefix to any other AST. Thus, its cor- 
responding equivalence class is empty. Note that every 
MXST is the core tree of some leaf node. The con- 
verse is not true, because a (k — l)-leaf tree may not 
be the prefix of a given /c-leaf tree, yet can be a sub- 
tree of it. The root of the enumeration tree is an empty 
node that has all the equivalence classes containing 3- 
leaf ASTs as its children. This is because three is the 
minimum number of leaves on which phylogenetic infer- 
ence can be meaningful. For an equivalence class E } the 
branch at E represents the subtree induced in the enu- 
meration tree by all the leaf descendants of £, or simply 
E if it is a leaf. ASTs X and Y are considered to be 
of a common descent if neither is a descendant of the 
other. 
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Figure 6 Enumeration Tree example. Each node in the tree represents an equivalence class. Trees in an equivalence class differ only with respect 
to their rightmost leaves (circled in bold for each tree). The bubble at the top of a node contains the core tree of the corresponding equivalence 
class. An equivalence class contains all ASTs that have its core tree as their common prefix. The core tree of an equivalence class belongs to its 
parent equivalence class. For example, the core tree of equivalence class B is a, which belongs to A — the parent of B. All 3-leaf ASTs have been 
partitioned into equivalence classes A, G and J (children of the root node). The leaf nodes (indicated by shaded ellipses) are empty equivalence 
classes and their core trees represent potential MXSTs. Here, d and e, the respective core trees of leaf nodes C and D, are the only MXSTs. They also 
happen to be the MASTs for the input trees. 



Pairwise join 

The canonical form has the property that pruning either 
the last leaf or the second-to-last leaf encountered in the 
IDFT yields a subtree that is also canonical [21]. Thus, 
every /c-leaf AST T corresponds to a unique ordered pair 
(T x , T y ) of (k — l)-leaf ASTs where T x and T y are obtained 
by pruning the last leaf and the second-to-last leaf respec- 
tively in the IDFT of T Note that T x and T y share a 
common prefix. Conversely, T can be obtained by "join- 
ing" this unique pair (T x , T y ). Based on this, we define tree 
T to be a join on an ordered pair (T x , T y ) of (k — l)-leaf 
ASTs such that T x and T y share a common prefix, if: 



T is in canonical form, has T x as its prefix, and 
has T v as its subtree. 



(i) 



Our scheme exploits condition (1) heavily. Consider 
equivalence classes E and F, where E is the parent of F and 
E consists of (k — l)-leaf ASTs. We claim that any /c-leaf 
tree T e F is the result of joining two (k — l)-leaf trees in 
E. Specifically, T is the result of joining a unique ordered 
pair of trees (T x , T y ) in E such that condition (1) is satis- 
fied. Observe that for any ordered pair (T X) T y ) satisfying 
condition (1) with respect to T, T x is the core tree of F and 
belongs to E. Further, T x and T y share a common prefix; 
thus, T y also belongs to E. The claim follows. 

While every tree in F can be obtained by joining a 
unique ordered pair (T x , T y ) of trees in £, there may 



be multiple Ts satisfying condition (1) with respect to 
ordered pair (T x , T y ), The way in which (T xt T y ) join to 
produce T depends on the topology of the subtree dis- 
played by an input tree over the leaf set Cj x U>C^. We next 
describe the four possible ways in which an ordered pair 
(T x , T y ) can join as per condition (1). In the subsequent 
discussion, let x and y denote the rightmost leaf of T x and 
T y respectively, and, p x and p y denote the parents of x and 
y respectively. Recall that E c represents the core tree of 
equivalence class E. Note that E c is also the common pre- 
fix of T x and T y . Let r denote the rightmost leaf of E c . For 
an internal node u, let numChild(w) denote its number of 
children. The rightmost path of a tree is the path from the 
root to its rightmost leaf. There are three possibilities for 
relative values of depth -^(/fy) and depth Tx (p x ) giving rise 
to different types of joins. 

If depth T y (p y ) = depth Tx (p x ), the following three types 
of joins are possible. 

Type 1: Figure 7(a) shows the participating trees. Leaves 
x and y are attached at the same depth on the 
rightmost path of E c , i.e., 
depth Ty (p y ) = depth Tx (p x ). Figure 7(b) shows 
the resulting join. Here, x and y are attached as 
siblings to the same parent node in the joined 
tree. Thus, for the resulting joined tree to be 
canonical, we must have (x) < \jr (y) (recall that 
we assume that the labels are distinct numbers). 
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Figure 7 Different types of pairwise join. A dotted triangle represents part of the tree that may be empty, while a solid triangle represents a 
non-empty part of the tree. A reflects topologies of the heaviest subtrees, 'd denotes the rightmost leaf of the common core tree, (a) T x and T y in 
type-1 and 2 joins, (b) Result of type- 1 join, (c) Result of type-2 join, (d) Sample inputs T x and T y in type-1 and 2 joins, (e) Result of type- 1 join on 
sample inputs, (f) Result of type-2 join on sample inputs, (g) T x and T y in type-3 join, (h) Result of type 3 join, (i) T x and T y in type-4join. (j) Result of 
type-4join. (k) Sample inputs T x and T y in type-3 join. (I) Result of type 3 join on sample inputs, (m) Sample inputs T x and T y in type-4join. (n) Result 
of type-4join on sample inputs. 
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Further, x and y are attached at the same depth 
in the joined tree as in T x and T y , respectively. 
Example: Figure 7(d) shows the input trees and 
Figure 7(e) shows the corresponding joined tree. 

Type 2: The input trees have the same structure as in a 
type 1 join (Figure 7(a)); however, in the joined 
tree x and y are attached to the same parent at 
one level deeper than their respective depths in 
the participating trees. See Figure 7(c). 
Example: Figure 7(d) shows the input trees and 
Figure 7(f) shows the corresponding joined tree. 

Type 3: Figure 7(g) shows the participating trees. Note 
that the participating trees are a special case of 
type 1 and 2 join; i.e., depth ^(/y) = depth Tx (p x ) 
holds here as well. However, in the resulting join 
p y becomes the parent of p x as shown in 
Figure 7(h). For this to be possible, we must have 
numChild^) = 2 in T y . 

Example: Figure 7(k) shows the input trees and 
Figure 7(1) shows the corresponding joined tree. 
If depth Ty (p y ) < depth 7 * 0?*), the following join 
type arises. 

Type 4: Figure 7(i) shows the participating trees. Here 

depth Ty (py) < depth Tx (p x ); i.e., on the rightmost 
path of £ c , leaf y is attached at a lesser depth 
than leaf x As a result there is only one way to 
join T x and T y so as to satisfy condition (1). See 
Figure 7(j). Here,p y becomes an ancestor of p x in 
the joined tree. 

Example: Figure 7(m) shows the input trees and 
Figure 7(n) shows the corresponding joined tree. 

Finally, note that if depth T y(p y ) > depth Tx (p x ), no joins 
are possible: T x and T y cannot be joined while satisfy- 
ing condition (1), because T x cannot be the prefix of the 
joined tree. ASTs from such joins are enumerated when 
considering the ordered pair (T y , T x ). 

The above scheme leads to a natural formulation for 
generating all members of children of £. For every ordered 
pair ( T x , T y ) e E such that the pair joins only in one way 
in all the trees in the input collection, add the joined tree 
to Et x - The ordering indicates that the joined tree has the 
first tree of the ordered pair as its prefix. 

Support estimation 

An AST is enumerated by combining two smaller ASTs. 
However, an AST can arise out of their combination only 
if the two ASTs exhibit a common type of join (topology) 
in all the input trees. Determining this involves identifying 
the types of joins the smaller ASTs exhibit across the input 
trees, and if a particular join is supported by all the input 
trees. For this we deploy a one-time least common ances- 
tor based preprocessing step, after which the join type in 
each input tree can be identified in constant time. 



Consider an ordered pair of trees (T xt T y ) in an equiva- 
lence class E and let C u = £>t x ^ C-T y - For an input tree 
r, we say the join induced by (T x , T y ) in T is of type A if 
T\jry in canonical form corresponds to a type A join with 
respect to ordered pair (T x , T y ). Let V om denote T\jry 
in canonical form. This step classifies V om as one of the 
four join types. If a particular join is supported by all the 
input trees (i.e./ = 1), the corresponding joined tree is 
an AST. A natural way to classify the join type could be 
to restrict T to £ u , canonicalize the restriction and iden- 
tify the canonicalized tree as one of the four join types. 
This can be done in time linear in the size of T using 
the algorithm given in reference [35]. Theorem 1, stated 
next, improves on this by giving a least common ances- 
tor (LCA) based scheme that in constant time identifies 
jjom as a resiJ Q t G f one G f t h e f our j 0 j n types. The LCA 

values are computed as a preprocessing step. The mean- 
ing of the symbols x,y,p x &nc\ p y is the same as in Section 
"Pairwise join" on page 5. Let r denote the rightmost 
leaf of core tree of E. Superscripts indicate the reference 
tree. 

Theorem 1. The following holds: 

I. V om is a result of a type-1 join on ordered pair 
(T x ,T y ) if and only if 

(a) depth(LCA r (r,*)) = depth(LCA r (r,;)/)), 

(b) depth(LCA r (r,*)) = depth(LCA r (*,;)/)) and 

(c) f(x) < f(y). 

2 jjom - ls a resu \t 0 f a type-2 join on ordered pair 
(T x , T y ) if and only if 

(a) depth(LCA r (r,*)) = depth(LCA r (r,;y)), 

(b) depth(LCA r (r,*)) < depth(LCA r (*,;)/)) and 

(c) f(x) < f(y). 

3. V oin is a result of a type-3 join on ordered pair 
(T x , T y ) if and only if 

(a) depth Tx (p x ) = depth T y(p y ) and 

(b) depth(LCA r (r,*)) > depth(LCA r (r,;y)). 

4. V om is a result of a type-4 join on ordered pair 
(T Xf T y ) if and only if depth Tx (p x ) > depth T y(p y ). 

Proof Let us consider each case separately. 

1. Clearly if V om is a result of a type-1 join, it 
satisfies lla-llc. To prove the only if part, 
let 11a- 11c be satisfied. Since T x and T y are obtained 
by attaching x and y respectively to the rightmost 
path of £ c , and each is a subtree of T, 11a implies that 
depth (^) = depth (p^). Thus, V om is a result of 
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either a type 1, 2 or 3 join. A type-3 join requires p y 
to be the parent of p x in T Jom , which is ruled out 
by 11a. Further, lib and 11c imply that the join must 
be of type 1. 

2. The proof is similar to that of part 1. Again, V om can 
be a result of either a type-1 or 2 join. Conditions 22b 
and 22c imply that the join must be of type 2. 

3. Condition 33a implies that V oin must be a result of a 
type-1, 2 or 3 join. Condition 33b rules out joins of 
type 1 and 2. Thus the join must be of type 3. 

4. Follows from the definition of a type-4 join. 

□ 

Cases 1-4 are mutually exclusive and each can be 
evaluated in constant time as follows. 

We first preprocess the input trees to answer each 
LCA-based query in Theorem 1 in constant time (see 
Section "Complexity analysis" on page 13 for more 
details). Using this, instead of identifying the join type in 
all the trees in the input collection individually, we do it 
in constant time across all the input trees at once. That is, 
in the case of ASTs, for any given ordered pair (T x , T y ) we 
answer in constant time if a join of the pair results in an 
AST. 

Containing the combinatorial explosion 

Although, in principle, we could enumerate all MXSTs 
by traversing the complete enumeration tree of ASTs, 
the sheer number of ASTs makes this approach, used by 
itself, impractical. To mitigate the impact of combinatorial 
explosion, we use a heuristic that, given a node in the enu- 
meration tree, determines, without traversing the subtree 
below the node, whether any of its leaf descendants con- 
tains a MXST. If none of them does, we prune the branch 
at the node. 

Pruning heuristic 

Let X and Y be two equivalence classes. We say that X 
prunes the branch of the enumeration tree at Y, or sim- 
ply that X prunes Y, if X and Y are of a common descent, 
and for every descendant A of Y (including Y), there exists 
at least one descendant B of X (which can be X itself) 
such that B c displays A c . If X prunes Y, none of the leaf 
descendants of Y can be an MXST. Thus, the branch at Y 
need not be enumerated. For example, in Figure 6, node 
A prunes node G because G has 3 descendants — itself, 
H and / — , whose respective core trees are displayed by 
the respective core trees of nodes B, C and D, which are 
descendants of A. If this information is known when G is 
first visited, the branch at G can be pruned. Similarly, A 
also prunes / because the respective core trees of nodes /, 
K and L are displayed by the respective core trees of nodes 
B, C and D. Further, among the descendants of A, nodes E 
and F are respectively pruned by nodes C and D. 



For the next set of results, let E denote an equivalence 
class with r as the rightmost leaf of its core tree. Let 
X, Y, Z be children of E in the enumeration tree. Let x, y, z 
be the rightmost leaves of X c , Y c , Z c respectively. Clearly 
{X c , Y c , Z c } e E. We say [/,/, k] is an agreement triplet if 
all the input trees display the same topology over the leaf 
set {/,;, k}. For an ordered pair of trees (A,B) in an equiv- 
alence class, having a and b as their respective rightmost 
leaves, we say tree T a b exists if (A,B) join as T a b across 
all the input trees. Let /, / and K be three trees belong- 
ing to a common equivalence class; let /, j and k be their 
respective rightmost leaves. Suppose that and exist. 
If the ordered pair (Ty, T%) exhibits a common join across 
all the input trees, we denote the join as T^. Theorem 2 
characterizes pruning among "siblings" and "first cousins" 
in the enumeration tree. 

Theorem 2. 

1 . X prunes Y if either of the following holds: 

(a) T xy exists and is not of join type 2. 

(b) T xy exists as a join of type 2 and for every 
child ZofE such that T yz exists, [ x, y, z] is an 
agreement triplet 

2. If T xy and T yz exist, and T xy is of join type 2, then 
Ej xy prunes Ej yz if T yz is not of join type 2. 

Part 1 of Theorem 2 deals with pruning among siblings; 
part 2 deals with pruning among first cousins. The proof 
of Theorem 2 relies on Lemmas 1, 2, 3 and 4, which we 
present next. 

Lemma 1. Suppose that T xy and T yz exist Then, the 
following holds: 

1. If T xy is not a result of a type-2 join: 

(a) T xz exists and is not a result of a type-2 join. 

(b) There exists an AST T on leaf set 

Ce c U {x,y, z] and Ej is a descendant ofX. 

2. Ifboth T xy and T yz are results of join type 2, and 
[ x,y, z] is an agreement triplet: 

(a) T xz exists. 

(b) There exists an AST T on leaf set 

Ce c U {x,y, z] and Ej is a descendant ofX. 

3. If T xy is a result of a type-2 join and T yz is not a result 
of a type-2 join: 

(a) T xz exists and is not a result of a type-2 join. 

(b) There exists an AST T on leaf set 

jCe c U {x,y, z} and Ej is a descendant ofEj xy , 
i.e., T xy - Z exists. 
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Lemma 1 gives conditions under which for a given child 
£Ty X of Y, X has a descendant (specifically a grandchild) 
whose core tree displays T yx . Intuitively, this is an inter- 
mediate step in proving conditions under which X either 
prunes Y or £? . The specific results of Lemma 1 help in 
cascading the effect to further descendants of Y or £j . 
Each T xy and T yz can be a result of one of the four types of 
join. Thus, considering T xy and T yz together, there are 16 
possibilities. Out of these, the case where each T xy and T yz 
is result of a type-2 join is the only case where there can 
be multiple topologies that display both T xy and T yz . That 
is why type-2 joins have a special significance in Lemma 1. 
The remaining 15 cases guarantee the existence of a single 
topology that displays both T xy and T yz . 

Lemma 2. If Txy and T xz exist, and [x,y, z] is an agreement 
triplet, either T x 

—yz T x — zy exists. 

Lemma 2 states that for any two children of X with y 
and z as the rightmost leaves of their respective core trees, 
the existence of agreement triplet [x,y,z] is a sufficient 
condition for the children to join in one of the two pos- 
sible ways. That is, there exists only one topology that 
displays both the children; further, the topology must be a 
descendant of X. 

Lemma 3. Suppose A and B are two ASTs with a and b 
as their respective rightmost leaves such that Ca C jCb- 
Then, £b is a descendant of £a if and only if for every i e 
{Cb \ Ca), T a i exists. 

Lemma 4. Suppose T xy exists as a result of a type-2 join 
and D is descendant of Y. Then, for any {a, b} e {£d c \ 
Cy c ), [x, a, b] is an agreement triplet if[x,y, a] and [ x,y, b] 
are agreement triplets. 

Lemma 4 deals with the special case of type-2 joins for 
T xy . Intuitively, it states that for any two children of Y 
with a and b as the rightmost leaves of their respective 
core trees, agreement triplets [x,y, a] and [x,y, b] together 
form a sufficient condition for the existence of a descen- 
dant of X whose core tree displays the core trees of both 
children of Y. This is a stepping stone in proving that 
for every descendant D of Y, there exists a descendant 
of X whose core tree displays the core tree of D, thus, X 
prunes Y. 

The proofs of the above Lemmas are given in Appendix. 
We present next the proof of Theorem 2. In the proof, 
and in the rest of the paper, we represent trees in 
parenthesized Newick format (http://evolution.gene tics. 
washington.edu/phylip/newicktree.html). E.g., (a, (b, c)) 
represents the tree with leaf set {a, b, c} where the LCA of 
b and c is a proper descendant of the LCA of a, b and c, 
and (a, b, c) represents unresolved (star) tree on {a, b, c}. 



Proof of Theorem 2. 

1. (a) Consider any descendant S of Y. Consider 
any {a, b} e {Cs c \ Cyc). Since S is a 
descendant of Y, T ya e Y and T y b e Y exist 
as per Lemma 3. Further, since T xy is not of 
join type 2, as per Lemma 1.1: (i) T xa e X and 
T x t e X exist, and each is also not a result of 
a type 2-join, and (ii) there exists ASTs on 
leaf sets Ce c U {x,y, a} and Ce c U {x,y, b}, 
thus, [x, y, a] and [x, y, b] are agreement 
triplets. Let A and B denote the children of E 
with a and b as their respective rightmost 
leaves. Since AST S exists, there exists an 
AST on leaf set Ce c U {a, b], i.e., either 
T a £, £ A or Tb a e B exists. Without loss of 
generality, let T a y exist. Since, T xa exists and 
is not a result of a type-2 join, and T a b exists, 
as per Lemma 1.1, there exists an AST on leaf 
set Cec U {x, a, b}. Thus: 

• for every leaf i e {Csc \ Cx c }> T x i exists, 
thus, for every pair of leaves {x\, x<i\ in 
X c , [x\,X2, 1] is an agreement triplet. 

• for every pair of leaves 

{a, b] e {Csc \ £x c }> there exists an AST 
on leaf set Cec U {x, a, b], thus, for every 
leaf i in X c , [I, a, b] is an agreement 
triplet. 

Thus, there exists an AST T on leaf set 
Cs c U Cx c - Clearly T displays S c . Further, for 
every i e {Csc \ Cx c }, T x i exists. Thus, by 
Lemma 3, T is descendant of X. Hence, X 
prunes Y, as claimed, 
(b) Consider any descendant S of Y. Consider 
any {a, b} e {Csc \ Cyc}. Since, S is a 
descendant of Y, by Lemma 3, T ya and T y b 
exists. Thus, as per the if condition of the 
claim to be proved, [x, y, a] and [x, y, b] are 
agreement triplets. Since T xy exists as a result 
of type-2 join and, [x,y, a] and [x, y, b] are 
agreement triplets, by Lemma 4, [x, a, b] is an 
agreement triplet, and, by Lemma 1 (part 2 
and 3), T xa and T x b exist. Thus, as per 
Lemma 2, there exists an AST on leaf set 
Cec U {x, a, b}. Thus: 

• for every leaf i e {Csc \ Cx c ), T x i exists, 
thus, for every pair of leaves {x\, x<i\ in 
X c , [x\,X2, 1] is an agreement triplet. 

• for every pair of leaves 

{a, b] e {Csc \ Cxc} and for every leaf i 
in X c , there exists and AST on leaf set 
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Ce c U <2, b}, thus, [ £, <2, b] is an 
agreement triplet. 

Thus, there exists an AST T on leaf set 
Cs c U Cx c - Clearly T displays S c . Further, for 
every i e {Csc \ Cx c }> T x i exists, thus, by 
Lemma 3, T is descendant of X. Hence, X 
prunes Y, as claimed. 
2. Since T yz is not a result of a type-2 join, by 

Lemma 1.3, T xz exists and is not a result of a type-2 
join, and T xy - Z exists. Consider any descendant S of 
£j yz . Consider any i e {Cs c \ £>T yz }- Since S is a 
descendant of Y, by Lemma 3, T y i exists. We show 
that T y i is not a result of a type-2 join by 
contradiction. Let T y i be a result of a type-2 join with 
triplet [ r,y, i] of type (r, (y, I)). Since S is a 
descendant of £7* , by Lemma 3, the join 
r^_^ G £7 on ordered pair (7^ z , 7^) exists. Let T yz 
be of type 1 join with triplet [r,y, z] of type (r, y, z). 
Since 1^-^ displays both (r, and (r, it 
must also display (r, (y, l),z). However, (r, (j, I), z) 
cannot exist in T y - Z i because i and z cannot be the 
last and second-to-last leaves respectively in the 
IDFT. Similarly, for T yz of join type 3 or type 4, we 
can show that i cannot be the rightmost leaf in 
Ty-zi, Thus, T y i is not of join type 2. Thus, by 
Lemma 1.3, T x i exists and is not a result of type 2 
join, and T xy -i exists. Consider any 
{a, b} e {Csc \ £r xy }. Thus, both T xa and T xh exist 
and each is not a result of join type 2. Let A and B 
denote the ASTs in E with a and b as their rightmost 
leaves. Without loss of generality, let the AST on leaf 
set Ce c U {a, b] be a result of a join on the order pair 
(A C ,B C ), i.e., T a b exists. Thus, by Lemma 1.1, there 
exists an AST on leaf set Ce c U {x, a, b}. Thus: 

• for every i e {Cs c \ £-T xy }> Txy-i exists, thus, for 
every pair of leaves {^1,^2} in T xyy [x\,X2,l] is 
an agreement triplet. 

• for every {a, b] e {Cs^ \ £>T xy } and for every leaf 
i in T xy , there exists an AST on leaf set 

jCe c U [x, a, b}, thus, [I, a, b] is an agreement 
triplet. 

Thus, there exists an AST T on leaf set Cs c ^ £>T xy - 
Clearly T displays S c . Further, since for every 
i e {Csc \ Cx xy }> Txy-i exists, by Lemma 3, T is 
descendant of £ j xy . Hence, £j prunes £t xz > as 
claimed. 

□ 

Pruner-list. Note that conditions in part (la) and part 
(2) of Theorem 2 can be evaluated in constant time while 
testing the condition in part (lb) of Theorem 2 takes time 



linear in the size of the tree being pruned. Neither of 
these require enumerating the pruned branch at Y. How- 
ever, there are cases when 7 or a descendant of Y is 
pruned by X but it cannot be identified using Theorem 2. 
In this case, the branch at Y must be enumerated and 
potential pruning by X verified. For this, we maintain a 
"pruner-list" for every child of Y. We explain this idea 
next. 

Let X, Y, Z be children of an equivalence class E in the 
enumeration tree. Let x,y,z be the rightmost leaves of 
X c , Y c , Z c respectively. Let joins T xy , T yz exist such that 
none of the cases of Theorem 2 hold. Then the pruner-list 
°f£r yz : 

contains x if [x,y,z] is an agreement triplet. (2) 
inherits members from the intersection of the 

(3) 

pruner lists of £r y and £t z . 
Now, using arguments similar to the proof of 
Theorem 2, we can show that: 

Theorem 3. Y is pruned by an equivalence class A with a 
as the rightmost leaf of A c if either of the following holds: 

1. Yhas no children and has a in its pruner-list 

2. All children of Y have a in their pruner-list 

MfstMiner 

Algorithm 1 is a high-level description of MFSTMINER 
for the special case of enumerating all MXSTs for a col- 
lection C of input trees. MfstMiner first invokes enu- 
merateAST_Triplets (whose details are omitted) to 
enumerate all AST triplets and to partition them into 
equivalence classes. The set of all such classes is denoted 
by EC3. Note that each equivalence class in EC3 is a child 
of the root of the enumeration tree. After this, Mfst- 
MlNER invokes subroutine ENUMERATENODE, explained 
next, to enumerate the elements of the branch at each 
equivalence class in EC3. 



Algorithm 1 Enumerating all MXSTs — a high-level 

description 

MfstMiner(C) 

EC 3 <- enumerateAST_Triplets(C) 

for all E e EC3 do 
enumerateNode(£) 

end for 



Algorithm 2 shows the details of ENUMERATENODE, 
which accepts an equivalence class E as input and enumer- 
ates the branch at E in the enumeration tree. In the pseu- 
docode, T x denotes an AST that has x as its rightmost leaf. 
Lines 3-5 perform pair- wise joining among members of an 
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equivalence class in the enumeration tree. Comments in 
braces indicate where the algorithm performs assignment 
to pruner-lists, as per conditions (2) and (3), and pruning, 
according to the different cases of Theorems 2 and 3. In 
line 12, the core tree of an empty equivalence class — rep- 
resenting a leaf node — is produced as output if it is found 
to be not pruned. 



Algorithm 2 Enumerating the branch at E in the 

enumeration tree 

enumerateNode(£) 

l: for all T x g E do 

2: if Et x is not flagged as pruned then {As per 

Theorem 2(1 a)} 
3: for all T y e {E — T x } such that join T xy exists do 
4: Flag £x y if pruned by £j x {As per 

Theorem 2(1 a)} 
5: Add Txy to St x 

6: Initialize pruner-list of £j {As per condition 

(3)} 

7: end for 

8: end if 

9: end for 
10: for all T x e E do 
11: if £t x has no children then 

12: Output T x if £t x not flagged as pruned and its 
pruner-list is empty {As per Theorems 2(1 a) and 
3(1)} 

13: else 

14: Remove trees from £j x that can be pruned {As 
per Theorem 2(2)} 

15: For every T e £t x > update pruner-list of £ t {As 
per condition (2)} 

16: if £t x cannot be pruned then {As per Theo- 
rems 2(1 b) and 3(2)} 

17: ENUMERATENODE(£rJ 

18: end if 

19: end if 

20: end for 



In line 17, enumerateNode calls itself recursively to 
enumerate the children of a non-empty equivalence class 
£t x , provided £j x is found to be not pruned. 

The general case of enumerating MFSTs 

We now explain how MFSTMlNER handles the general 
case of mining MFSTs. The main difference between min- 
ing for MXSTs and mining for MFSTs is that the former 
is (as we have seen) based on enumerating ASTs, while 
the latter is based on enumerating FSTs. While an ASTs 
must be supported by all the input trees (i.e.,/ = 1), an 
FSTs need only be supported by some fraction/ e (^, l) 



of the input trees. This difference affects neither the enu- 
meration tree nor the pairwise join, but it does affect 
support estimation and the pruning strategy. We discuss 
these steps next. 

Support estimation. Given T x and T y in an equivalence 
class, a join T xy on ordered pair (T x , T y ) is an FST if it 
is supported by at least a fraction / of the input trees 
(i.e., if at least a fraction/ of the input trees have T xy as 
a subtree). Note that any such T xy is supported only by 
those trees that support both T x and T y as well. Moti- 
vated by this, for each FST T x we maintain a support 
list, denoted by 7^. supList, that contains all trees in the 
input collection that support T x . To estimate if the join on 
(T x , T y ) results in an FST, we apply Theorem 1 only on 
trees in Xc.supList fl 7^. supList. We store the support list 
as a bitmap [36] for efficient memory utilization and fast 
computation of intersection of support lists. 

Pruning strategy. To verify whether an equivalence class 
X prunes an equivalence class Y, we also need to consider 
the support lists ofX c and Y c . We say [x,y,z] is & frequent 
triplet if at least fraction/ of the input trees display the 
same triplet over the leaf set {x,y,z}. Let [x,y,z] .supList 
denote the support list of such a frequent triplet. Based on 
this, we can restate Theorem 2 for the case of enumerating 
MFSTs as follows. 
Theorem 4. 

1 . X prun es Y if eith er 

(a) T xy exists, Y c .supList c X c .supList and T xy is 
not of join type 2, or 

(b) T xy exists as a type-2 join, 

Y c .supList C X c .supList and for every Z e E 
such that T yz exists, [x,y,z] is a frequent 
triplet with Y c .supList c.[x,y,z] .supList 

2. If T xy and T yz , exist and T xy is of join type 2, then 
£r xy prunes £j yz if T yz is not of join type 2 and 
T yz .supList C T xy . supList 

Pruner-list. Pruning cases not identified by Theorem 4 
require the use of pruner-list. In the case of MFSTs, along 
with leaf label the pruner-list also contains the support 
list of the core tree of the equivalence class that is claim- 
ing to prune. To explain further, let T x , T yt T z be FSTs in 
an equivalence class E with x, y, z as their respective right- 
most leaves, such that joins T xy and T yz exist, and none 
of the cases of Theorem 4 hold. For an equivalence class 
A, let AprunerList denote its pruner-list. Then, the next 
set of conditions describe pruner-lists for enumerating 
MFSTs. 

1. fj^.prunerList contains the entry (x, T xy . supList) if 
Txy is not of join type 2 and | T xy . supList fl 
T^.supList| > /. 
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2. £r .prunerList contains the entry (x, Sn = 
I^.supListHl^j/, z] .supList) if T xy is of join type 2, 
[x,y, z] is a frequent triplet and \Sn Hi I^.supListl >f. 

3. For every leaf label w such that 
(w,S y n ) e ^.prunerList and 

(w,S z n ) e £r z .prunerList exist, f^.prunerList 
contains entry (w, Sn = S y n fl S^) if 
|5n H I^.supListl >/. 

Condition 1 and 2 describe addition of new labels to the 
pruner-list of £x yz , while condition 3 describes inheritance 
of labels from the intersection of the pruner-lists of £j y 
and £t z . Now, the corresponding result for Theorem 3 can 
be stated as: 

Theorem 5. Given an equivalence class A with a as the 
rightmost leaf of A c , 

1. £j yz is pruned by A if (i) for every T y b e £j y , 
£T yb -prunerList contains an entry (a, S^a)), and (ii) 
forSn = r\T yb eS Ty S ^ a )> £T yz >supList = S n . 

2. £j y is pruned by A if (i) £j y is empty and has a in its 
pruner list, or (ii) every child £j yb of£j y is pruned by 
A as per part 1 of this Theorem. 

This completes the description of MFSTMlNER for the 
general case of mining MFSTs. The overall framework is 
the same as the special case of mining all MXSTs. The 
difference lies in the finer details of incorporating sup- 
port lists in the support estimation and the pruning step. 
These details were discussed in this section and are easy 
to incorporate in Algorithms 1 and 2. 

Complexity analysis 

Here we discuss the runtime complexity of the prepro- 
cessing step, the data structures used in the algorithm 
implementation and the memory requirements in each 
step of MFSTMlNER algorithm. In the following discus- 
sion, let the input collection consist of n trees on a com- 
mon leaf set C and let / denote the input fraction for 
computing/-frequent subtrees. 

Preprocessing. This one-time task involves (1) comput- 
ing LCA mappings for all pairs of leaves for all the input 
trees and (2) enumerating all frequent triplets. For each 
tree T in the input collection, and every pair [u, v} of 
leaves of T, step (1) computes LCA r (w, v) and stores it in a 
three-dimensional array indexed by triplet (T, u, v), thus, 
requires 0(n\C\ 2 ) space. In our implementation, these 
LCA values are computed in quadratic time and space 
per tree by traversing the tree in a depth-first manner 
and computing the LCA values of the leaf-descendants 



at a node. Thus, for all the input trees, Step 1 takes 
0(n\C\ 2 ) time and space. In the case of MXSTs only a 
two-dimensional array is used (requiring 0(|£| 2 ) space) 
to store the LCA values because an LCA value is relevant 
only if it is the same across all input trees, i.e.,/ = 1. We 
should point out that it is well-known that one can pre- 
process a tree in linear time and space to produce a data 
structure that can answer any LCA query on that tree in 
constant time [37-39]. Such algorithms are quite useful 
when the number of LCA queries is limited and the pre- 
processing dominates the total time. That is not the case in 
our application. Indeed, MfstMiner queries all possible 
LCA values while enumerating all MFSTs on three leaves, 
and then does a constant number of LCA queries for 
every join operation thereafter. Although both our three- 
dimensional array and the specialized LCA data structures 
[37-39] offer constant-time access to LCA-values, the for- 
mers constant factor is smaller than the latters, which 
makes a significant difference in practice. 

Step 2 takes 0(n\C\ 3 ) time and space, and uses the pre- 
computed LCA values from step (1). In the case of MXSTs, 
just as storing of LCA values requires less space (0(|£| 2 )), 
the complexity in step (2) is also reduced: 0(|£| 3 ) time 
and space. 

Join-operation. Every join operation requires a constant 
number of LCA queries and depths of certain nodes (see 
Theorem 1). However, note that the relative depths are 
needed rather than absolute values. In most of the cases of 
Theorem 1 the relative depths can be obtained by exam- 
ining the type of tree a certain triplet displays. That is, 
Theorem 1 can be restated as: 

1. V om is a result of a type-1 join on ordered pair 
(T x , T y ) if and only if 

(a) the triplet on leaves {r, x, y} is of type (r, x, y) 
and 

(b) f(x) < fiy). 

2 jjom j s a reSL Qt 0 f a type-2 join on ordered pair 
(T x , T y ) if and only if 

(a) the triplet on leaves {r, x, y] is of type 
(r,(x,y)) and 

(b) \jf(x) < i/f(y). 

3. V om is the result of a type-3 join on ordered pair 
(T X) Ty) if and only if 

(a) depth Tx (p x ) = depth 2 * (ty) and 

(b) the triplet on leaves {r, x,y} is of type 
((r,x),y). 

4. V om is a result of a type-4 join on ordered pair 
(T x , T y ) if and only if 
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(a) depth Tx (p x ) > depth 2 ^ (p y ) and 

(b) the triplet on leaves {r, x,y] is of type 
(Sr,x),y). 

The advantage of reformulating Theorem 1 as above is 
that (a) for every tree in the input collection, we need not 
store the depth of its nodes, and (b) for every FST enumer- 
ated as a result of a join operation, we only need to store 
the depth of the parent of its rightmost leaf. Further, the 
type of tree a certain triplet displays can be easily known 
through a constant number of queries on precomputed 
LCA values. 

Support-list and Pruner-list. In the case of MXSTs, 
there is no support-list and the pruner-list contains leaf 
labels. This takes 0(|£|) space per enumerated AST that is 
currently in memory. Note that MFSTMlNER uses a depth- 
first strategy for traversing the enumeration tree. Thus, 
not all enumerated ASTs are held in memory at any given 
time. In the case of MFSTs, both support-list and pruner- 
list exist. The support-list for an FST contains the list of 
input trees that display the FST. This takes 0(n/w) space 
per enumerated AST that is currently in memory, where 
w is the size of the machine word. The i th bit of a support- 
list is set to 1 if the corresponding FST is displayed by 
the i th tree in the input collection, and 0 otherwise. Our 
experiments were run on a w = 64 bit processor, which 
is typically the case with any modern computer. An entry 
in pruner-list contains a leaf-label and a support-list. This 
takes 0((n/w) + \C\) space per enumerated AST that 
is currently in memory. These are worst-case estimates 
and in practice the memory consumption is much less 
because pruner-list is required only for cases not captured 
by Theorems 2 and 4. 

The next result shows that the memory required by 
MFSTMlNER scales polynomially with the number of 
input trees and the size of the common leaf set. 

Theorem 6. MfstMiner requires 0(n\C\ 3 ) space in the 
case of enumerating MFSTs and 0(|£| 3 ) space when only 
MXSTs are being enumerated. 

Proof. The enumeration tree has depth at most \C\. Enu- 
merating an FST at this depth will require storing 0(|£|) 
ancestor equivalences classes, each of which can have at 
most \C\ FSTs. Thus, the maximum number of FSTs to be 
stored is 0(|£| 2 ). Storing each such FST requires 0(|£|) 
space for the subtree, 0(n/w) space for the support-list 
and 0((n/w) + \C\) space for the pruner-list, where w is 
the size of the machine word. Thus, the maximum space 
required to store all FSTs is 0(((n/w) + \C\)\C\ 2 ). Adding 
to this the space required to store LCA mappings, which 
is 0(n\£\ 3 ), we get the claimed figure. Similarly, it can be 
shown that in the case of MXSTs, MFSTMlNER requires 
0(|£| 3 ) space. □ 



Let | J 7 1 be the number of FSTs (not MFSTs) the input 
collection displays. Then, the worst-case time complex- 
ity of MfstMiner is 0(n\F\ + \C\\F\), which is the 
same as EvoMiners [10]. The time complexity is not 
polynomial in the number of MFSTs, because we cannot 
polynomially bound the number of FSTs that MFSTMlNER 
will enumerate to verify the pruning of an equivalence 
class. Indeed, it could happen that X prunes Y, but that 
this cannot be confirmed by Theorems 2 and 4. If so, 
MfstMiner must enumerate the branch at Y further, 
and there is no polynomial bound on the number of FSTs 
that would have to be generated to verify the pruning of 
rbyX 

Nonetheless, as we will see in the next section, even 
though MfstMiner shares the same worst-case time 
complexity with EvoMlNER, it can be orders or magni- 
tudes faster than EvoMlNER in practice. 

Results and discussion 

To study the effectiveness of MfstMiner, we conducted 
four categories of experiments: 

1. Comparison of MFSTs with MASTs. 

2. Comparison of MfstMiner with EvoMlNER [10] — 
the state-of-the-art algorithm for enumerating all 
phylogenetic FSTs. 

3. Evaluation of the scalability of MFSTMlNER with 
respect to the number of trees, the size of the leaf set 
and the support value. 

4. Comparison with Ramu et al/s [23] approach that 
mines MFSTs having maximum leaves. 

Our dataset consists of bootstrapped trees from a pre- 
vious study [40] on bootstrapping methods. There are 
seventeen sets of trees constructed from a diverse range of 
sequences including rbcL genes, mammalian sequences, 
bacterial and archaeal sequences, ITS sequences, fun- 
gal sequences, and grasses. The number of taxa in these 
single-gene and multi-gene DNA sequences vary from 
125-2554. The entire dataset is available at http://lcbb. 
epfl.ch/BS.tar.bz2. We refer to the seventeen datasets as 
A-Q in the increasing order of taxa in the DNA sequences 
from which the trees were constructed. A corresponds to 
the set of trees with 125 taxa and Q corresponds to the 
set of trees with 2554 taxa. To extract datasets with differ- 
ent numbers of leaves and trees, we randomly selected the 
required number of trees and restricted them on a random 
set of leaves of the required size. 
The experiments were split over 4 machines: 

• Two machines running Windows 7 64-bit with 
processor clock-speed of 3.4 GHz, 4 cores and 8 
threads. 

• One machine running Windows 7 64-bit with 
processor clock-speed of 3.16 GHz and 2 cores. 
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• One machine running Linux 64-bit with processor 
clock-speed of 2.0 GHz, 6 cores and 12 threads. 

Each experiment was averaged over 5 runs. For practical 
purposes, each run was allowed a maximum of 10000 sec- 
onds. Thus, any missing entry in the graphs indicate that 
the corresponding experiment took more than this limit. 
Initially, we started with all experiments on one machine 
and using only one core at a time — the ideal environ- 
ment — to allow maximum fairness in comparing results, 
however, we soon realized that this would take more than 
600 days of runtime. Thus, we split the experiments using 
maximum number of cores and threads on each machine. 
This did slow down things because of competing memory 
and disk requirements, however, the final runtimes should 
not be more than a factor of two of the runtimes in the 
ideal environment. 

MFSTs vs. MASTs. Figure 8(a) compares the size of the 
MAST with the size of the largest MFST. This experiment 
was conducted on a set of 100 trees on 50 leaves from each 
of the datasets. MFSTs were enumerated for/ = 0.51. In 
some cases the largest MFST is more than twice as big as 
the corresponding MAST. 

Figure 8(b) compares the number of MASTs with the 
number of MFSTs for / = 1. There are significantly 
more MFSTs. This is notable, because any MFST that is 
not a MAST is also not displayed by any of the MASTs. 
Thus, such a MFST reveals unique agreement information 
among the input trees. This experiment was conducted 
on a set of 100 trees on 100 leaves from each of the 
datasets. 

Comparison with EvoMlNER. This experiment was 
conducted on a set of 1000 trees on 40 leaves from 
each of the datasets. Figure 9(a) compares MfstMiner 
with EvoMlNER [10] for / = .55 with respect to run- 
time. Figure 9(b) shows the corresponding number of 
MFSTs and FSTs mined by MFSTMINER and EvoMlNER 
respectively. Figures 9(c)-(d), 9(e)-(f), and 9(g)-(h) show 
the corresponding figures for support / = .75, / = 
.95 and/ = 1.0 respectively. We see that enumerating 
MFSTs can very often be orders of magnitude faster than 
enumerating all FSTs. The time difference arises due to 
the number of subtrees mined. The ratio of the num- 
ber of subtrees mined by EvoMlNER to the number of 
subtrees mined by MfstMiner is maximum for sup- 
port values / = .75 and / = .95, thus, MfstMiner 
is fastest with respect to EvoMlNER in these cases. For, 
/ = 1.0 EvoMlNER is often faster than MFSTMINER. 
We believe this is because (a) the runtimes are too small 
for a fair comparison and thus, the pre-processing time 
(enumerating all frequent triplets), which is same for 
both EvoMlNER and MFSTMINER, dominates the total 



time, and (b) some implementation inefficiency in MFST- 
MINER is suspected. The missing dataset entries corre- 
spond to cases where EvoMlNER took more than 10000 
seconds. 

Scalability of MfstMiner. We evaluated the scalabil- 
ity of MfstMiner with respect to the number of leaves 
(10-250), the number of trees (100-10000) and the support 
value (.51-1.0) on datasets having at least 250 leaves, i.e., 
datasets D (354 taxa) — Q (2554 taxa). Presenting results 
for all datasets would have been overwhelming, thus, we 
discuss results for datasets D (354 taxa), K (1481 taxa) and 
Q (2554 taxa) — the first, the last and a middle one from 
datasets D-Q. 

Figure 10(a) shows the runtime for 200 trees, with the 
number of leaves varying from 10-250, for support values 
/ = .55,/ = .75,/ = .95 and/ = 1.0 on dataset D. 
Figure 10(b) shows the corresponding number of MFSTs 
mined. Figures 10(c)-(d) and 10(e)-(f) show the corre- 
sponding results for 1000 and 5000 trees respectively. The 
results show that for a given number of trees, the number 
of subtrees mined increases steadily with the increase in 
the number of leaves in the input trees, while the runtime 
follows closely the number of subtrees mined. 

Figure 11(a) evaluates the variation in runtime for 50 
leaves on 100-1000 trees for support values / = .55,/ = 
.75,/ = .95 and / = 1.0 on dataset D. Figure 11(b) shows 
the corresponding number of MFSTs mined. Figures 11(c) 
and 11(d) show the corresponding values while varying 
the number of trees from 2000-10000. Figures ll(e)-(h) 
and ll(i)-(l) show the corresponding results for input 
trees with 100 and 150 leaves respectively. The results 
show that for a given number of leaves, the number of sub- 
trees very much remain the same as the number of input 
trees is varied, while the runtime increases steadily with 
increase in the number of input trees. This is expected 
because the support estimation takes loner with more 
trees. 

Figures 12, 13, 14 and 15 show the corresponding results 
for datasets K and Q respectively. The trends are simi- 
lar to dataset D except that dataset Q seems to produce 
much more MFSTs, thus, the runtimes are larger. Results 
from dataset K seem to lie somewhere in the middle of the 
range of results from datasets D and K. 

The above results also show that MfstMiner can han- 
dle much larger datasets than EvoMlNER. Again, the 
missing entries are due to the 10000 second time limit. 
However, if time is not a constraint, as discussed before, 
the memory requirements of MFSTMINER is polynomial 
in the size of the input, thus, it can handle large datasets. 

Comparison with Ramu et al/s approach. We com- 
pared our approach with Ramu et al.s [23] heuristic 
approach that mines MFSTs with maximum leaves. We 
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Figure 8 Utility of MFSTs over MASTs. (a) MFSTs have more leaves than MASTs; thus, they reveal common agreement over a larger set of taxa 
than MASTs. (b) MXSTs are more numerous than MASTs; thus, they reveal more agreement agreement information than MASTs. 



used the original implementation shared by the authors. 
The current implementation mines only one MFST with 
maximum leaves, thus, we cannot compare the number 
of subtrees mined by our approach with theirs. Further, 
the current implementation has major portions written 
in Perl, a high-level interpreted programming language, 
and involves significant disk usage for storing intermedi- 
ate data-structures, whereas, our implementation is writ- 
ten in C++, a much lower-level compiled programming 
language, and keeps all intermediate data-structures in 
memory. Thus, we did not compare the runtime because 
our implementation has much advantage with respect to 
the speed of execution. We compare the size of the MFST 
with maximum leaves mined by Ramu et al.'s [23] imple- 
mentation with ours. We did this comparison on a set of 
100 trees with 20 leaves from each of the datasets for sup- 
port values / = .55 and / = .75. In 25 out of 34 cases, the 
size of the MFST with maximum leaves returned by Ramu 
et al.'s [23] approach was at least as good as ours. Only in 9 
cases it returned an MFST with one leafless than ours. So, 
if the goal is to get an MFST with most leaves, Ramu et al.'s 
[23] approach seems near-perfect. As mentioned in their 
paper [23], it also seems to be capable of handling large 
datasets in terms of the number of trees and the number 
of leaves. However, if the goal is to mine all MFSTs with 
maximum leaves or simply all MFSTs, then our approach 
serves better. This can be very useful because there can 
be a lot of MFSTs (either with maximum leaves or all of 
them), and every MFST returned by our approach conveys 
some unique agreement information not conveyed by any 
of remaining returned MFSTs. 



Conclusions 

Although we have restricted our attention to enumerat- 
ing MFSTs for/ e (\, l], we can extend MfstMiner to 
enumerate all MFSTs for / e (0, |], with small modifi- 
cations in the pruning strategy. Note, however, that when 
f e (0, ^] there can potentially be different MFSTs with 
the same leaf set. 

As a future work, we intend to do a thorough com- 
parison of MFSTs against MASTs, in the settings where 
MAST is currently used [2,5,7]. Since the time to enumer- 
ate MFSTs for larger leaf sets can be prohibitive, we also 
intend to develop schemes to sample at random from the 
set of all MFSTs. 

An intriguing open problem is to devise methods to 
find common patterns in collections of phylogenetic net- 
works [41-44]. Although techniques from maximal sub- 
graph mining [16,17] may prove useful here, the special 
characteristics of phylogenetic networks add interesting 
twists to the problem. We also intend to extend our work 
for mining frequent sub-structures in multi-labeled trees 
[45-48]. 

The current implementation of MFSTMINER, which 
works for up to 250 leaves and 10000 trees, is available at 
https://code.google.eom/p/mfst-miner/. 

Appendix: Proofs 

Proof of Lemma 1. 

1. Suppose T xy is a result of a type-1 join. Thus, 

iff (x) < iff (y) and agreement triplet [ r, x, y] is of type 
(r,x,y). We have four possibilities to consider. 
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Figure 9 Comparison with EvoMiner. (a) Runtime comparison for f = .55. (b) Number of subtrees enumerated for f = .55. (c) Runtime 
comparison for f = .75. (d) Number of subtrees enumerated for f = .75. (e) Runtime comparison for f = .95. (f) Number of subtrees enumerated 
for f = .95. (g) Runtime comparison for f = 1 .0. (h) Number of subtrees enumerated for f = 1 .0. 
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Figure 10 Scalability of MfstMiner on dataset D (354 taxa) while varying the number of leaves in the input trees, (a) Runtime comparison 
on 200 input trees, (b) Number of subtrees enumerated on 200 input trees, (c) Runtime comparison on 1000 input trees, (d) Number of subtrees 
enumerated on 1000 input trees, (e) Runtime comparison on 5000 input trees, (f) Number of subtrees enumerated on 5000 input trees. 



(a) T yz is a result of a type-1 join (see 
Figure 16(a)). Thus, \jr{y) < \jr(z) and 
agreement triplet [r,y,z\ is of type (r, y,z). 
Potential AST T must be obtained by 
grafting z in T xy . Since T must display 
(r,y, z), there is only one possibility of 
grafting z in T xy : z should be grafted on the 
common parent of x and y. Thus, AST T 



exists. Further, since, \jr{x) < i/rfy) < \js(z), 
there is only one possible canonical topology 
for potential AST T: see Figure 16(a). Since T 
has x, y, z, as the third-to-last, second-to-last 
and last leaf respectively in the IDFT, pruning 
y will result in a tree that is (a) canonical, (b) 
has z as the last leaf in the IDFT, and (c) has 
x as the second-to-last leaf in the IDFT. By 



Deepak and Fernandez-Baca Algorithms for Molecular Biology 201 4, 9:1 6 
http://www.almob.Org/content/9/1 /1 6 



Page 19 of 30 



30 
25 
¥ 20 

| 

| 10 
5 
0 

10 4 
_10 3 

Jjio 2 

I 10- 





/ 


\ - 

V 


55% 
75% 
95% 
100% 


i: / 

- A- 

-©-- /"""^ 










100 


400 700 

Number of trees 

(a) 


1000 




x x x 

• X ^ 






* - * / 55% 

75% 

- 0 95% 


a---.fi - 


▲ - " A 


- A- 



40 j 

o 

| 

f^lO 1 ■ 
10° 



100 400 700 1000 
Number of trees 



(e) 



100 400 700 
Number of trees 



(i) 



\0 b 
clO 5 

|io 4 

o 

lio 3 

10 2 
10 1 



..10 s 

8 

|io 7 

O 

felO 6 

I 

lio 5 

10 4 



10 y - 
«10 8 - 

I : 

|io 7 : 

o 

llO 6 • 



400 700 
Number of trees 



(b) 



55% - 
75% 



100 400 700 1000 
Number of trees 



(f) 



400 700 
Number of trees 



100% -G - 

-A- -A- -A- -i 



©•—©—€3 

1000 



(j) 







10 6 






I 104 




-A * 




55°/ 




|io 2 


75?- 








- A- 






-©-- 


10° 



2000 4000 6000 8000 10000 
Number of trees 



(c) 



^10 3 



95% 
100% 



2000 4000 6000 8000 10000 
Number of trees 



10 3 r 



(g) 



95% - A- 
100% -G - 



2000 4000 6000 8000 10000 
Number of trees 



(k) 




2000 4000 6000 8000 10000 
Number of trees 



10 7 



Sio t 



-210 5 

a 



(h) 



95% - A- 
100% -G - 



2000 4000 6000 8000 10000 
Number of trees 



(1) 



Figure 1 1 Scalability of MfstMiner on dataset D (354 taxa) while varying the number of input trees, (a) Runtime comparison with 50 leaves 
in the input trees while varying the number of input trees from 100 to 1000. (b) Number of subtrees enumerated with 50 leaves in the input trees 
while varying the number of input trees from 1 00 to 1 000. (c) Runtime comparison with 50 leaves in the input trees while varying the number of 
input trees from 2000 to 10000. (d) Number of subtrees enumerated with 50 leaves in the input trees while varying the number of input trees from 
2000 to 1 0000. (e) Runtime comparison with 1 00 leaves in the input trees while varying the number of input trees from 1 00 to 1 000. (f) Number of 
subtrees enumerated with 1 00 leaves in the input trees while varying the number of input trees from 1 00 to 1 000. (g) Runtime comparison with 1 00 
leaves in the input trees while varying the number of input trees from 2000 to 10000. (h) Number of subtrees enumerated with 100 leaves in the 
input trees while varying the number of input trees from 2000 to 1 0000. (i) Runtime comparison with 1 50 leaves in the input trees while varying the 
number of input trees from 100 to 1000. (j) Number of subtrees enumerated with 150 leaves in the input trees while varying the number of input 
trees from 1 00 to 1 000. (k) Runtime comparison with 1 50 leaves in the input trees while varying the number of input trees from 2000 to 1 0000. (I) 
Number of subtrees enumerated with 150 leaves in the input trees while varying the number of input trees from 2000 to 10000. 



definition, the resulting tree is T xz (see 
Figure 16(a)). Thus, T xz exists. Further, the 
topology of T xz implies that T xz is a result of 
type 1 join. Similarly, pruning the last leaf, 
i.e., z, in T will result in T xy . Thus, St is a a 
child of £ T xy > thus, a descendant of X 
(b) T yz is a result of a type-2 join (see 
Figure 16(b)). Thus, x/rfy) < if(z) and 
agreement triplet [r,y,z] is of type (r, (y,z)). 
Potential AST T must be obtained by 
grafting z in T xy . Since T must display 
(r> (y>z))> there is only one possibility of 
grafting z in T xy : z should be grafted on the 



edge (p y ,y). Thus, AST T exists. Further, 
since, xfripc) < ^(y) < ^(z), there is only one 
possible canonical topology for potential AST 
T; see Figure 16(b). Since T has x, y, z as the 
third-to-last, second-to-last and last leaf 
respectively in the IDFT, pruning y will result 
in a tree that is (a) canonical, (b) has z as the 
last leaf in the IDFT, and (c) has x as the 
second-to-last leaf in the IDFT. By definition, 
the resulting tree is T xz (see Figure 16(b)). 
Thus, T xz exists. Further, the topology of T xz 
implies that T xz is a result of a type-1 join. 
Similarly, pruning the last leaf, i.e., z, in T will 
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Figure 12 Scalability of MfstMiner on dataset K (1481 taxa) while varying the number of leaves in the input trees, (a) Runtime comparison 
on 200 input trees, (b) Number of subtrees enumerated on 200 input trees, (c) Runtime comparison on 1000 input trees, (d) Number of subtrees 
enumerated on 1000 input trees, (e) Runtime comparison on 5000 input trees, (f) Number of subtrees enumerated on 5000 input trees. 



result in T xy . Thus, £ j is a a child of £ T xy , 
thus, a descendant of X. 
(c) T yz is a result of a type-3 join 

(see Figure 16(c)). Potential AST T must be 
obtained by grafting x in T yz . Since T must 
display (r,x, y)> there is only one possibility of 
grafting x in Tyz- x should be grafted on the 



parent of y. Thus, AST T exists. Further, 
since, \jr(x) < if(y), there is only one possible 
canonical topology for potential AST T; see 
Figure 16(c). Since T has x, y, z as the 
third-to-last, second-to-last and last leaf 
respectively in the IDFT, pruning y will result 
in a tree that is (a) canonical, (b) has z as the 
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Figure 13 Scalability of MfstMiner on dataset/C (1481 taxa) while varying the number of input trees, (a) Runtime comparison with 50 
leaves in the input trees while varying the number of input trees from 100 to 1000. (b) Number of subtrees enumerated with 50 leaves in the input 
trees while varying the number of input trees from 1 00 to 1 000. (c) Runtime comparison with 50 leaves in the input trees while varying the number 
of input trees from 2000 to 10000. (d) Number of subtrees enumerated with 50 leaves in the input trees while varying the number of input trees 
from 2000 to 1 0000. (e) Runtime comparison with 1 00 leaves in the input trees while varying the number of input trees from 1 00 to 1 000. (f) 
Number of subtrees enumerated with 100 leaves in the input trees while varying the number of input trees from 100 to 1000. (g) Runtime 
comparison with 100 leaves in the input trees while varying the number of input trees from 2000 to 10000. (h) Number of subtrees enumerated 
with 1 00 leaves in the input trees while varying the number of input trees from 2000 to 1 0000. (i) Runtime comparison with 1 50 leaves in the input 
trees while varying the number of input trees from 1 00 to 1 000. (j) Number of subtrees enumerated with 1 50 leaves in the input trees while varying 
the number of input trees from 100 to 1000. (k) Runtime comparison with 150 leaves in the input trees while varying the number of input trees from 
2000 to 10000. (I) Number of subtrees enumerated with 150 leaves in the input trees while varying the number of input trees from 2000 to 10000. 



last leaf in the IDFT, and (c) has x as the 
second-to-last leaf in the IDFT. By definition, 
the resulting tree is T xz (see Figure 16(c)). 
Thus, T xz exists. Further, the topology of T xz 
implies that T xz is a result of a type-3 join. 
Similarly, pruning the last leaf, i.e., z, in T will 
result in T xy . Thus, £ t is a a child of £j , 
thus, a descendant of X. 
(d) T yz is a result of a type-4 join 

(see Figure 16(d)). Potential AST T must be 
obtained by grafting x in T yz . Since T must 
display (r, x, y), there is only one possibility of 



grafting x in Ty Z : x should be grafted on the 
parent of y. Thus, AST T exists. Further, 
since, \j/{x) < x/fiy), there is only one possible 
canonical topology for potential AST T; see 
Figure 16(d). Since T has x, y z as the 
third-to-last, second-to-last and last leaf 
respectively in the IDFT, pruning y will result 
in a tree that is (a) canonical, (b) has z as the 
last leaf in the IDFT, and (c) has x as the 
second-to-last leaf in the IDFT. By definition, 
the resulting tree is T xz (see Figure 16(d)). 
Thus, T xz exists. Further, the topology of T xz 
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Figure 14 Scalability of MfstMiner on dataset Q (2554 taxa) while varying the number of leaves in the input trees, (a) Runtime comparison 
on 200 input trees, (b) Number of subtrees enumerated on 200 input trees, (c) Runtime comparison on 1000 input trees, (d) Number of subtrees 
enumerated on 1000 input trees, (e) Runtime comparison on 5000 input trees, (f) Number of subtrees enumerated on 5000 input trees. 



implies that T xz is a result of a type-4 join. 
Similarly, pruning the last leaf (i.e., z) in T 
will result in T xy . Thus, £ j is a a child of Ej xyJ 
thus, a descendant of X. 

Similarly, one can show that if T xy is a result of a join 
of type 3 or 4, irrespective of the type of join T yz is a 



result of, ASTs T and T xz exist, T xz is not a result of a 
type-2 join, and £j is a descendant of X. 
2. Suppose each of T xy and T yz is a result of a type-2 join 
(see Figure 17(a)). Thus, agreement triplet [r,y,z] is 
of type (r, (y,z)) and x/rfy) < ty{z). Potential AST T 
must be obtained by grafting z in T xy . Since T must 
display (r, (y,z)) and \jr(x) < i/f(y) < \jf(z), there are 



Deepak and Fernandez-Baca Algorithms for Molecular Biology 201 4, 9:1 6 
http://www.almob.Org/content/9/1 /1 6 



Page 23 of 30 



io 4 ^ 



10 7 F 



10 4 r 



10 J ; 
I 10 2 ; 



io- 

10" : 



55% — I — 
75% 



100 400 700 
Number of trees 



(a) 



!lO°r 



100 400 700 1000 
Number of trees 



(e) 



100 400 700 1000 
Number of trees 



(i) 



<»10 0 

a 

llO 5 ; 

o 

|l0 4 

g 103 

10 2 



|10 5 
z 



100% -e 



400 700 
Number of trees 



10° L 



2000 4000 6000 8000 10000 
Number of trees 



(b) 



(c) 





100% -e - 


- Q x 





















100 400 700 1000 
Number of trees 



2000 4000 6000 8000 10000 
Number of trees 



100 400 700 1000 
Number of trees 



2000 4000 6000 8000 10000 
Number of trees 



(j) 



(k) 



1 
Z 



75% 

95% - At 

100% -0- 



2000 4000 6000 8000 10000 
Number of trees 




(d) 



2000 4000 6000 8000 10000 
Number of trees 



00 



2000 4000 6000 8000 10000 
Number of trees 



(1) 



Figure 15 Scalability of MfstMiner on dataset Q (2554 taxa) while varying the number of input trees, (a) Runtime comparison with 50 
leaves in the input trees while varying the number of input trees from 100 to 1000. (b) Number of subtrees enumerated with 50 leaves in the input 
trees while varying the number of input trees from 1 00 to 1 000. (c) Runtime comparison with 50 leaves in the input trees while varying the number 
of input trees from 2000 to 10000. (d) Number of subtrees enumerated with 50 leaves in the input trees while varying the number of input trees 
from 2000 to 1 0000. (e) Runtime comparison with 1 00 leaves in the input trees while varying the number of input trees from 1 00 to 1 000. (f) 
Number of subtrees enumerated with 100 leaves in the input trees while varying the number of input trees from 100 to 1000. (g) Runtime 
comparison with 100 leaves in the input trees while varying the number of input trees from 2000 to 10000. (h) Number of subtrees enumerated 
with 1 00 leaves in the input trees while varying the number of input trees from 2000 to 1 0000. (i) Runtime comparison with 1 50 leaves in the input 
trees while varying the number of input trees from 1 00 to 1 000. (j) Number of subtrees enumerated with 1 50 leaves in the input trees while varying 
the number of input trees from 100 to 1000. (k) Runtime comparison with 150 leaves in the input trees while varying the number of input trees from 
2000 to 10000. (I) Number of subtrees enumerated with 150 leaves in the input trees while varying the number of input trees from 2000 to 10000. 



four possible canonical topologies for potential AST 
T: see Figures 17(b)- 17(e). However, since [x,y,z] is 
an agreement triplet, only one of the four topologies 
exists across all input trees. Thus, AST T exists. 
Further, in all the cases discussed above, the last two 
leaves in the IDFT of T are y and z (either y comes 
before z or vice-versa), while x is the third-to-last 
leaf. Thus, pruning y in T will result in a canonical 
tree where x and z are the second-to-last and last 
leaves in the IDFT. By definition, the resulting tree is 
T xz . Thus, T xz exists. Further, pruning the last leaf in 
T will result in either T xy (if z is the last leaf in T) or 



T xz (if y is the last leaf in T). In either case, £t is a 
descendant of X, as claimed. 
3. Suppose T xy is a result of a type-2 join. Thus, 
\jr(x) < i/s(y). Consider the following possibilities: 

(a) T yz is a result of a type-1 join (see 

Figure 17(f)). Thus, ij/(y) < ir(z). Potential 
AST T must be obtained by grafting x in T yi 
Since T must display (r, (x,y)), there is only 
one possibility of grafting x in Ty Z : x should 
be grafted on the edge (p y ,y). Thus, AST T 
exists. Further, since, \jr{x) < ^r(y) < \jr(z), 
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Figure 16 Supportive illustrations for the proof of Lemma 1, part 1. T xy is a result of type- 1 join in all the cases, (a) T yz is a result of type- 1 join, 
(b) T yz is a result of type-2 join, (c) T yz is a result of type-3 join, (d) T yz is a result of type-4 join. 




there is only one possible canonical topology 
for potential AST T: see Figure 17(f). Since T 
has x, y, z as the third-to-last, second-to-last 
and last leaf respectively in the IDFT, pruning 
y will result in a tree that is (a) canonical, (b) 
has z as the last leaf in the IDFT, and (c) has 
x as the second-to-last leaf in the IDFT. By 
definition, the resulting tree is T xz (see 
Figure 17(f)). Further, the topology of T xz 
implies that T xz results from a type-1 join. 



(b) T yz is a result of a type-3 join (see 

Figure 17(g)). Potential AST T must be 
obtained by grafting x in T yz . Since T must 
display (r, (x,y)), there is only one possibility 
of grafting x in Ty Z : x should be grafted on 
the edge (p y ,y). Thus, AST T exists. Further, 
since, \J/(x) < ty(y), there is only one possible 
canonical topology for potential AST T: see 
Figure 17(g). Since T has x, y, z as the 
third-to-last, second-to-last and last leaf 
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respectively in the IDFT, pruning y will result 
in a tree that is (a) canonical, (b) has z as the 
last leaf in the IDFT, and (c) has x as the 
second-to-last leaf in the IDFT. By definition, 
the resulting tree is T xz (see Figure 17(g)). 
Further, the topology of T xz implies that T xz 
results from a type-3 join, 
(c) T yz is a result of a type-4 join (see 

Figure 17(h)). Potential AST T must be 
obtained by grafting x in T yz . Since T must 
display (r, (x,y))> there is only one possibility 
of grafting x in T yz : x should be grafted on 
the edge (p yt y). Thus, AST T exists. Further, 
since, x/r(x) < i/f(y), there is only one possible 
canonical topology for potential AST T: see 
Figure 17(h). Since T has x, 7, z as the 
third-to-last, second-to-last and last leaf 
respectively in the IDFT, pruning y will result 
in a tree that is (a) canonical, (b) has z as the 
last leaf in the IDFT, and (c) has x as the 
second-to-last leaf in the IDFT. By definition, 
the resulting tree is T xz (see Figure 17(h)). 
Further, the topology of T xz implies that T xz 
results from a type-4 join. 

Further, in all the above cases, y and z are the 
second-to-last and the last leaves in T. Thus, pruning 
z — the rightmost leaf — in T will result in a tree 
that is (a) canonical, (b) has y as the last leaf in the 
IDFT, and (c) has x as the second-to-last leaf in the 
IDFT. By definition, the resulting tree is T xy . Hence, 
Sj is a descendant of S T xy , as claimed. 

□ 

Proof of Lemma 2. Without loss of generality, let \[r (y) < 
\jf(z). Let p x , p y and p z denote the parent of x, y and z 
respectively. Consider the following cases: 

1. depth Txy (p y ) > depth Txz (p z ): As per Theorem 1, 
T x -yz exists as a result of a type-4 join . 

2. depth Txz (p z ) > depth Tx y(p y ): As per Theorem 1, 
T x - zy exists as a result of a type-4 join. 

3. depth 7 ** (^) = depth Txz (p z ): Consider the following 
sub-cases: 

(a) Agreement triplet [x, y, z] is of type (x, y, z). 
Thus, depth(LCA(#,y)) = 
depth(LCA(j/,z)) = depth(LCA(#, z)) across 
all input trees. Thus, as per Theorem 1, T x - yz 
exists as a result of a type-1 join. 

(b) Agreement triplet [x,y, z] is of type (x } (y,z)). 
Thus, depth(LCA(*,3/)) = depth(LCA(*, z)) 
anddepth(LCA(*,3/)) < depth(LCAO/,z)) 



across all input trees. Thus, as per Theorem 1, 
T x -yz exist as a result of a type-2 join. 

(c) Agreement triplet [x,y, z] is of type ((x,y), z). 
Thus, depth(LCA(*,y)) > depth(LCA(*,z)) 
across all input trees. Thus, as per Theorem 1, 
Tx-yz exists as a result of a type-3 join. 

(d) Agreement triplet [x, y, z] is of type ((x, z),y). 
Thus, depth(LCA(*,z)) > depth(LCA(*,y)) 
across all input trees. Thus, as per Theorem 1, 
T x - zy exists as a result of a type-3 join. 

Thus, in each of the above cases, either T x - yz or T x - zy 
exists, as claimed. □ 

Proof of Lemma 3. (Only If) An AST is enumerated by 
joining joining two trees in an equivalence class. Thus, the 
union of leaf sets of trees in an equivalence class is a subset 
of the union of leaf sets of trees in the parent equivalence 
class. Extending this reasoning, the union of leaf sets of 
trees in an equivalence class is a subset of the union of 
leaf sets of trees in any ancestor equivalence class. Thus, 
if Sb is a descendant of Sa> Cb is subset of the union of 
leaf sets of trees in Sa- Further, every tree in Sa has A 
as its prefix. Thus, for every i e {Cb \ Ca}> T a i must 
exist. 

(If) For every {/,;} e {C B \ C A }> T ai and T aj exist (if 
condition of the claim), and [a, if] is an agreement triplet 
(because {a, if} e Cb)- Thus, as per Lemma 2, either T a -ij 
or T a -ji exists. We show that there exists an t g {Cb \ Ca} 
such that for every i e {Cb \ {Ca U €}}, T a -u exists. 
If this is not the case, there exists a sequence of leaves 
(lo,li,...l n ) e {C B \ C A ) such that T a _ Wv T a -i x i v 
• • -> T a -i n _ x i ni T a _ij Q exist. Since 2V/ o/l and T a _ tl i 2 exist, 
and [ /n, /1, h] is an agreement triplet (because {/n, h,h} £ 
Cb), as per Lemma 1, AST T a _i 0 j 2 exists. Extending this 
reasoning, it can be shown that T a _i 0 i n exists — a contra- 
diction because T x _i n i Q already exists. Thus, there exists 
an t g {Cb \ Ca) such that for every i e {Cb \ {Ca U €}}, 
T a -u exists. By definition, each such T a -u belongs to 
St uV and St u1 is a child of A. Extending the same reason- 
ing, it can be shown that there exists a sequence of ASTs 
(T\ f l2...)> where T\ = A, such that St 1+1 is a child of Si 
and CT i+1 \ Ct { is a leaf in {Ca \ Cb}- By definition, the 
last AST in the sequence is B. Thus, Sb is a descendant 
of S A . □ 

Proof of Lemma 4. Since T xy is a result of a type-2 join, 
agreement triplet [r,x,y] is of type (r, (x,y)) and \j/(x) < 
\jf(y). Consider any i e {Cd c \ Cy c ) such that [x,y,l] 
is an agreement triplet. Since D is a descendant of Y, 
by Lemma 3, T y i exists. Since, [x,y,l] is an agreement 
triplet, by Lemma 1, there exists an AST T on the leaf 
set Cec U {x,y,l} that displays both T xy and T y i, and St 
is a descendant of X. Consider the possible topologies for 
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Figure 17 Supportive illustrations for the proof of Lemma 1, parts 2 and 3. T xy is a result of type-2 join in all the cases, (a) T yz is a result of 
type-2 join. There are four possibilities for a topology that can display both T xy and T yz : (b) — (e). (f) T yz is a result of type- 1 join, (g) T yz is a result of 
type-3 join, (h) T yz is a result of type-4 join. 



T. (Note that we have already iterated through the possi- reference.) Since T xy is a result of a type-2 join, the topol- 
ble topologies for such a T during the proof of Lemma 1, ogy of T depends on the type of join T y i is a result of. 
parts 2 and 3, but we enumerate them again for ease of Consider the following cases. 
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T y i is a result of a type-1 join. There exists only one 
possible canonical topology for T: the one 
corresponding to Figure 17(f) (replace z with i in 
tree T). 

Ty£ is a result of a type-2 join. There exists four 
possible canonical topologies for T: the ones 
corresponding to Figure 17(b) — 17(e) (replace z with 
i in tree T). 

T y i is a result of a type-3 join. There exists only one 
possible canonical topology for T: the one 
corresponding to Figure 17(g) (replace z with i in 
tree T). 



4. 7^ is a result of a type-4 join. There exists only one 
possible canonical topology for T: the one 
corresponding to Figure 17(h) (replace z with i in 
tree T). 

Note that out of the possible topologies for T, only one 
has i as the second-to-last leaf in the IDFT: the one cor- 
responding to Figure 17(e) (replace z with i in tree T). 
Thus, this topology belongs to belongs to £t x£ > the rest 
have y as the second-to-last leaf in the IDFT, thus belong 
to £ r xy . Consider any {a, b} e {C>d c \£y c } such that [x,y, a] 
and [x,y,b] are agreement triplets. Thus, by our earlier 




Figure 18 Supportive illustrations for the proof of Lemma 4. T a corresponds to the topology in Figure 1 7(e) (replace z with a in tree T) in all the 
cases, (a) T b corresponds to the topology in Figure 1 7(f) (replace z with b in tree T). (b) T b corresponds to the topology in Figure 1 7(b) (replace z 
with b in tree T). (c) T b corresponds to the topology in Figure 1 7(c) (replace z with b in tree T). (d) T b corresponds to the topology in Figure 1 7(d) 
(replace z with b in tree T). 
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discussion in this proof, there exist ASTs T a on leaf set 
Ce c U {x,y,d\ and T b on leaf set Le c U and, both 

Eja and are descendants of X Consider the following 
cases: 

1. Both T a and T b belong to £>t . Since [y, a, b] is an 
agreement triplet, by Lemma 2, there exists an AST 
on leaf set U {a, b}. Thus, [x, a, b] is an 
agreement triplet. 

2. Both T a and T b do not belong to £ Txy . Without loss 
of generality, let T a belong to tZj xa , i.e., it 
corresponds to the topology in Figure 17(e) (replace z 
with a in tree T). Thus, agreement triplet [x,y, a] is 
of type (,(x,a),y) and \j/(x) < if/ (a). Consider 



potential AST T' on leaf set U {a, b} that 
displays both T a and T b . Since, the topology T a is 
known, the topology of T depends on the topology 
of T b . Consider the following cases. In the 
subsequent discussion, let p x , p y , p a and pb denote 
the parent of x, y, a and b respectively. 

(a) T b corresponds to the topology in 

Figure 17(f) (replace z with b in tree T). 
Potential AST T must be obtained by 
grafting ainT b . Since V must display 
((x, a),y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (p x >x)- Thus, T' exists. Considering 




Figure 19 More supportive illustrations for the proof of Lemma 4. T a corresponds to the topology in Figure 17(e) (replace z with a in tree T) in 
all the cases, (a) T b corresponds to the topology in Figure 1 7(e) (replace z with b in tree T). (b) T b corresponds to the topology in Figure 1 7(g) 
(replace z with b in tree T). (c) T b corresponds to the topology in Figure 1 7(h) (replace z with b in tree T). 
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tyix) < ^{d), the canonical topology for 
AST T' is shown in Figure 18(a). 

(b) T b corresponds to the topology in 
Figure 17(b) (replace z with b in tree T). 
Potential AST V must be obtained by 
grafting a'mT b . Since V must display 
((x, a),y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (px>x). Thus, T exists. Considering 
\jr(x) < (a), the canonical topology for 
AST V is shown in Figure 18(b). 

(c) T b corresponds to the topology in 
Figure 17(c) (replace z with b in tree T). 
Potential AST V must be obtained by 
grafting ainT b . Since V must display 
((#, a),y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (pxtX). Thus, T' exists. Considering 
^(#) < ^r(a), the canonical topology for 
AST V is shown in Figure 18(c). 

(d) T b corresponds to the topology in 
Figure 17(d) (replace z with b in tree T). 
Potential AST T' must be obtained by 
grafting a in T b . Since T' must display 
((#, a), y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (px>x). Thus, T exists. Considering 
^(#) < ^r(a), the canonical topology for 
AST T' is shown in Figure 18(d). 

(e) T b corresponds to the topology in 
Figure 17(e) (replace z with b in tree T). 
Thus, \j/(x) < \j/(b). Since, both T a and 
correspond to the topology in Figure 17(e), 
without loss of generality, let ^r(a) < \J/(b). 
Potential AST T' must be obtained by 
grafting a in T b and must display ((x, a),y). 
Since ty(x) < \f/(a) < ty(b), there are four 
possible canonical topologies for T; see 
Figure 19(a). However, [y, a, b] is an 
agreement triplet, thus, only one of the four 
possible topologies exits across all input trees 
Thus, T exists. 

(f) T b corresponds to the topology in 
Figure 17(g) (replace z with b in tree T). 
Potential AST T" must be obtained by 
grafting a in T b . Since T' must display 
((x, a),y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (p x >x). Thus, T' exists. Considering 
tyix) < tyia), the canonical topology for 
AST T' is shown in Figure 19(b). 

(g) T b corresponds to the topology in 
Figure 17(h) (replace z with b in tree T). 
Potential AST V must be obtained by 



grafting a'mT b . Since T' must display 
((#, d),y), there is only one possibility of 
grafting a in T b : a should be grafted on the 
edge (p x ,x). Thus, T exists. Considering 
\ff(x) < ty(x), the canonical topology for AST 
V is shown in Figure 19(c). 

Thus, in each of the above cases T exists. Thus, 
[x, a, b] is an agreement triplet. 

This completes the proof. □ 
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